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State  Occupancy  Information 
for  Performance  Comparisons 

G.E.  Lyon 


A state-based  performance  characterization  attaches  fixed  processing  rates  to  each 
service  state.  However,  the  number  of  states  can  be  large.  Over  time,  the  sequences  of  such 
states  are  enormous.  Counts  of  active  (hut  interchangeable)  system  elements  define 
macrostates,  which  are  fewer.  Furthermore,  only  the  occupancy  levels  of  macrostates  are 
recorded.  This  removes  time  sequencings  as  a combinatoric  problem,  but  still  captures 
general  performance  details.  Applications  can  be  compared  independently  of  their 
algorithmic  structures. 

The  hypercube  and  other  distributed-memory  systems  bring  both  opportunities  and 
challenges  to  a state-based  approach.  Certainly,  processor  and  communication  activities  are 
more  easily  identified  and  more  independent  with  distributed-memory  than  with  shared- 
memory.  But  isolated  nodes  also  entail  problems  in  capturing  global  observation  states. 
However,  hypercube  application  codes  are  commonly  homogeneous  across  nodes,  so  that 
aggregating  local  state  information  works  well.  Three  paradigms  illustrate  homogeneous 
applications  with  communication  dependencies  that  are  strong  (global),  moderate  flocal),  or 
weak  (independent  once  spawned).  The  three  performance  summaries  are  accurate  and 
extremely  compact. 

Key  words:  application  comparisons;  distributed-memory;  performance;  system  states. 


Users  want  accurately  to  compare  general  performances  of  applications  on  newer  computer 
architectures.  They  want  this  without  the  tremendous  complications  of  algorithmic  structures  or  explicit 
time.  Benefits  of  such  an  approach  are  clear:  users  save  time  and  reduce  risks  of  bad  program-architecture 
matching;  they  build  insight  on  the  system  needs  of  classes  of  programs;  their  overall  picture  of 
computations  is  simpler.  Certainly,  a simplified  perspective  of  applications  on  a system  encourages 
comparisons  among  them.  Since  all  programs  on  a system  use  its  common  and  available  resources,  a 
characterization  at  the  system-resource  level  can  support  application  comparisons. 

The  discussion  assumes  that  collection  accuracy  is  not  a problem,  even  though  on  parallel  processing 
systems,  measurement  perturbation  can  cause  severe  distortions  in  performance  behavior.  Measurements 
cause  insignificant  perturbation  on  the  example  system,  which  has  custom  hardware 
instrumentation  [MCNR90] . The  results  indicate  that  in  many  circumstances  the  burden  of 
instrumentation  may  be  surprisingly  light 
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Performance  Characterization  via  State  Occupancies 


The  programmer  doing  algorithm  measurements  and  comparisons  is  constantly  reminded  that  there 
are  many  distinct  algorithms.  And,  while  computer  systems  are  potentially  as  varied  as  the  programmed 
applications  that  run  on  them,  few  computer  designs  are  actually  built  This  fact  serves  nicely  to  limit 
discourse.  Keeping  a focus  upon  real  systems,  the  following  approach  is  used: 


1.  Decompose  each  system  into  major  service  states  that  determine  general  performance.  A given 

class  of  machines  is  treated  as  having  but  a limited  set  of  dominant  states,  each  state  denoting 
a fixed,  set  of  processing  capacities  (rates). 

2.  View  each  application  as  a set  of  demands  upon  system  service  capacities  (rates),  with  a 

specific  application  demand  signified  by  an  accumulative  occupancy  in  the  corresponding 
system  state. 

3.  Build  and  interpret  models  based  upon  observed  state  occupancies  and  the  associated  rates. 

The  idea  of  observing  service  states  enjoys  simple  but  powerful  advantages.  The  entire  state  space  of 
responses  is  described.  This  space  is  closed,  rather  than  open,  and  thus  checks  the  resolution  and 
consistency  of  measurements.  Changes  to  input  lead  only  to  a redistribution  of  occupancies  among  service 
states.  There  arise  no  completely  new  behaviors.  As  a result,  state-based  measurement  has  structure.  The 
fineness  and  number  of  the  observation  states  hinge  upon  the  resolution  needed  to  explain  system 
performances  as  input  varies.  Specific  challenges  include  the  need  to  ensure  that  states  are  succinctly 
defined  and  recorded,  to  manage  a large  number  of  states  as  parallel  systems  scale  up  in  number  of 
processors,  and  to  compress  state  representations  as  systems  do  scale  in  size.  Averages  from  local  state 
information  fall  in  the  last  category,  and  are  discussed  later  for  homogeneous  codes  on  a (homogeneous) 
hypercube. 


Comparison  with  Related  Work 


Recent  workshop  discussions  with  R.  Saavedra-Barrara  (U.  CA.)  and  E.  Miya  (NASA)  highlight 
differences  in  methodology  (see  table,  below)  between  their  efforts  and  the  above 
approach  [LYO90,  SSM89].  First,  they  initiate  application  characterization  at  the  language  level— 
FORTRAN;  their  model  (now)  has  113  specific  parameters,  which  reduce  to  perhaps  14  or  more  general 
factors  [SSM89].  Their  virtual  FORTRAN  machine  model  invites  problems  in  performance  variance 
acquired  through  layers  of  construction  [LS90].  The  model’s  level  of  abstraction  is  perhaps  too  high.  In 
contrast,  service  states  at  a system-level  characterization  may  generate  only  4-10  parameters.  (Later 
examples  use  four.)  Hardware  developed  at  NIST  captures  service  state  occupancies  for  our  experiments; 
our  group’s  advantage  in  making  specialized  instrumentation  is  pronounced.  However,  many 
instrumentations  make  only  modest  demands,  such  as  a fast  clock  and  several  timer  registers  on  each 
processor  node.  Secondly,  the  Berkeley-NASA  researchers  choose  their  dependent  variables  principally 
as  statistical  predictors  on  mostly  serial  machines,  whereas  our  NIST  observables  are  tied  to  true  system 
states  (actually,  macrostates)  and  are  100%  measured.  On  parallel  architectures  this  difference  can  be 
very  significant  For  example,  program  communication  latencies  are  much  easier  to  measure  (NIST 
approach)  than  to  predict  (Berkeley-NASA).  Also,  a state-based  model  has  a formal  structure  that  can  be 
manipulated  to  advantage  [LY089].  Both  approaches  rely  upon  accumulative  effects  (e.g.,  time  measured 
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overall)  to  remove  the  combinatorics  of  sequences  (e.g.,  states  in  time). 


Category 

Approach 

Berkeley-NASA  [SSM89]  NIST  [LY089,  LS90] 

Objective 

compare  and  contrast 
architectures  via  standard 

FORTRAN  virtual  machine 
( includes  applications ) 

compare  applications  concisely 
via  major  service  states 
appropriate  for  each  architec- 
tural family 

Application  Treatment 

static:  analyze  source 

dynamic:  instrument,  execute 

Host  Treatment 

run  "kernel"  benchmarks 
to  establish  parameters 

special  measurement  hardware 
captures  information 

Modeling  Effort 

complex : includes  compiler 
actions,  system  behavior 

simple  if  have  well- 
chosen  state  variables 

Everyday 

Use 

easy,  automatic 

easy:  system  service  statistics 
harder:  specialized  application  data 

Added  Hardware 

none 

custom  instrumentation 

Support  Software 

customized  for  each  system 
component  ( compiler , scheduler,...) 

standard  packages 

Philosophy  Bias 

analytical 

empirical 

Contrast  in  Approaches 


Other  Representations.  States  present  another  issue,  that  of  practical  representation.  State  tables  shown 
in  sequel  are  such  that  every  attribute  (column)  appears  in  every  state  (row).  However,  when  a column  has 
little  effect  on  most  states  (rows),  it  is  convenient  to  select  all  unaffected  rows,  and  to  project  a new 
subtable  that  eliminates  the  column.  Repeated  applications  eventually  yield  many  small  subtables,  none  of 
which  is  complete  in  itself.  Under  these  circumstances,  a dependency  tree  works  considerably  better-it 
succinctly  displays  isolated  clusters  of  interaction  that  leave  other  states  essentially 
unaffected  [LY089,  LS90].  However,  in  the  context  to  follow,  a table  format  is  adequate. 


States  and  Macrostates 


Discussion  of  states  and  their  role  proceeds  via  a simple  example.  This  should  cause  no  loss  of 
generality.  Imagine  a parallel  system  ABCD  with  four  (p  =4)  processors:  a,  b,  c,  and  d.  The  architecture 
for  ABCD  remains  for  the  moment  unspecified.  At  any  time,  each  processor  is  attributed  to  one  of  three 
(m  = 3)  major  modes,  Alf  A2,  or  A3.  A concise  description  of  system  ABCD  at  time  ft  is  given  as  the 
microstate 


^2 

a3 

Rate 

tf- 

a 

b c 

d 

r.(a,  be,  d) 

Microstate  signifies  that  the  detail  of  individual  system  elements  is  recorded.  Rate  r*  0 provides  values  for 
processor  a in  Ax , b and  c in  A2,  and  d in  A3.  This  fully  characterizes  the  performance  of  ABCD  at  time  r,. 
Table  T1  shows  a time  trace  of  ABCD  microstates.  Time  is  defined  by  fields  index,  for  order  of  microstate 
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occurrence,  and  Duration , for  length  of  stay.  The  symbol  "0"  denotes  the  empty  set 


index 

A i 

A2 

Aj 

Duration 

Rate 

1 

abed 

0 

0 

3 

r*(abcd,  0, 0) 

2 

ac  d 

b 

0 

1 

r.(acd,  b,  0) 

3 

ab  d 

c 

0 

2 

r.(abd,  c,  0) 

4 

ac  d 

0 

b 

1 

r*(acd,  0,  b) 

5 

abed 

0 

0 

1 

r*(abcd,  0,  0) 

6 

ab 

c d 

0 

5 

r.(ab,  cd,  0) 

7 

c d 

ab 

0 

5 

r * (cd,  ab,  0) 

8 

ac 

bd 

0 

5 

/••(ac,  bd,  0) 

9 

0 

abed 

0 

1 

r*(0,  abed,  0) 

10 

abed 

0 

0 

2 

r*(abcd,  0, 0) 

11 

a 

bed 

0 

2 

r.  (a,  bed,  0) 

12 

0 

abc 

d 

1 

r* (0,  abc,  d) 

Tl:  System  ABCD  Time  Trace,  Microstates 

Microstates  often  contain  much  unnecessary  detail.  If  ABCD  is  a good  general-purpose  design,  all 
processing  nodes  likely  have  the  same  interconnect,  the  same  processor  chip,  and  the  same  memory 
architecture.  (It  simplifies  manufacture,  programming  and  maintenance.)  Since  no  component  is 
distinguished,  counts  of  processors  in  a mode  are  what  matter.  Thus,  the  earlier  microstate  example 


A\ 

A2 

a3 

Rate 

n- 

a 

be 

d 

r*  (a,  be,  d) 

becomes  the  macrostate 


1 

2 

1 

*1.2, 1) 

Table  T2  shows  that  converting  Tl  to  macrostates  produces  some  adjacent  rows  that  are  identical  except 
for  index  and  duration.  (Rows  2-3  and  6-8.)  These  groupings  can  be  merged,  adding  their  durations 
together.  Rate  r now  takes  integer  arguments,  rather  than  the  sets  of  r. . 


index 

^1 

A2 

^3 

Duration 

Rate 

1 

4 

0 

0 

3 

*4,0,0) 

2,3 

3 

1 

0 

3 

*3,1,0) 

4 

3 

0 

1 

1 

*3,0, 1) 

5 

4 

0 

0 

1 

*4,0,0) 

6,7,8 

2 

2 

0 

15 

*2,2,0) 

9 

0 

4 

0 

1 

*0,4,0) 

10 

4 

0 

0 

2 

*4,0,0) 

11 

1 

3 

0 

2 

*1,3,0) 

12 

0 

3 

1 

1 

*0, 3, 1) 

T2:  System  ABCD  Time  Trace,  Macrostates 
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The  next  simplification  introduces  macrostate  occupancy,  in  contrast  to  sequences  of  macrostates 
in  time.  Selecting  all  T2  rows,  project  T2  on  all  columns  except  index.  Excluding  Duration,  entries  in  the 
new  table  (T3)  with  duplicate  fields  merge  into  single  rows,  their  Duration  fields  summing  for  a new 
Occupancy  field.  Now,  while  the  set  of  macrostate  sequences  is  infinite,  macrostates  for  ABCD  with 
p =4,  m =3  are  [FEL64]: 


p +m-\ 

6' 

. P 

4 

The  transformation  to  occupancies  removes  explicit  time  and  the  complications  of  algorithmic 
sequencings  from  the  performance  characterization.  Useful  remaining  aspects  of  the  application  code 
include  its  independent  variables  and  their  settings  during  performance  experiments.  The  result  is  table 
T3.  (T3  does  not  show  settings  of  the  independent  variables  that  generated  it) 


^2 

^3 

Occupancy 

Rate 

4 

0 

0 

6 

r(4, 0, 0) 

3 

1 

0 

3 

r(3, 1, 0) 

3 

0 

1 

1 

r(3, 0, 1) 

2 

2 

0 

15 

r(2, 2, 0) 

0 

4 

0 

1 

r(0, 4, 0) 

1 

3 

0 

2 

r(l,3,0) 

0 

3 

1 

1 

r(0,  3, 1) 

T3:  System  ABCD  Occupancies,  Macrostates  of  A1A2A3 

In  T4  (below),  all  rows  of  T3  have  been  selected.  A projection  on  A\  also  sums  the  occupancies  of 
duplicate  rows.  As  an  example,  T3  entries 


Ai 

^2 

^3 

Occupancy 

Rate 

0 

4 

0 

1 

r(0, 4, 0) 

0 

3 

1 

1 

r(0,  3, 1) 

become  an  entry 


A ! Occupancy  Rate 


in  table  T4.  The  transformation  assumes  either  A2  and  A3  are  inconsequential  to  gross  system 
performance  and  are  ignored,  or  that  their  information  is  implicit  in  values  taken  by  r:( ).  Ai  is  commonly 
level-of-parallelism  for  many  parallel  system  analyses. 
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^1 

Occupancy 

Rate 

4 

6 

r i(4) 

3 

4 

r i(3) 

2 

15 

r i<2) 

1 

2 

r id) 

0 

2 

r i(0) 

T4:  System  Occupancies,  Macrostates  of  A j 


Further  compressions  of  state  information  are  often  necessary.  For  despite  simplifications  already 
introduced,  the  number  of  macrostates  (expressed  by  the  binomial  coefficient)  grows  swiftly  as  systems 
scale  up  in  processors.  Often,  a continuous  function  can  be  fit  to  give  a compact  approximation  to  a given 
attribute  (column  entries).  The  best-known  example  is  probably  the  Hockney- Jesshope  vector  rate  of 


n(n)  = r, 


n 


associated  with  attribute  Aj  = vector  length , n.  As  mentioned  earlier,  the  practical 


representation  of  states  is  a somewhat  distinct  topic.  See  [LS90]  for  examples. 


Local  State  Information.  A more  serious  problem  arises  when  immediate  global  knowledge  of 
processors  or  other  node  activity  is  lacking  (each  column  proclaims  a known  global  level  of  activity  for  its 
attribute).  Precise  global  timing  is  often  not  available  on  distributed  systems.  (Our  NIST-instrumented 
iPSC-1  does  global  timings.)  An  obvious  approach  is  to  examine  homogeneous  collections  of  nodes- 
those  identical  in  hardware  and  in  assigned  software  tasks.  This  is  not  unrealistic,  for  most  applications  on 
hypercubes  and  related  architectures  are  regular  problem  domains  attacked  via  code  replicated  across 
nodes  [FJL88].  The  regularity  of  hypercube  architecture  invites  such  solutions.  Observations  made 
independently  at  each  node  are  averaged  together.  Overall  application  consumption  of  system  time  is 
recorded  at  detail  levels  appropriate  for  nodes  of  the  machine.  The  approach  scales  well,  thereby 
matching  a strength  of  the  architecture.  Performance  results  are  given  as  an  ideal  (or  mean)  performance 
on  one  node.  For  example  ABCD,  mean  occupancies  in  Aj,  A2,  and  A3  are  available  by  calculation  from 
previous  tables.  For  A i of  table  T4  there  is  a mean  node  occupancy  of 

— *6  + — *4  + — *15  + ~*2  = 17 

4 4 4 4 

In  practice  occupancy  times  are  collected  separately  at  each  node  and  then  averaged.  The  result  is  the 
same,  but  unlike  using  table  T3  or  T4,  no  global  information  is  used  until  the  final  averaging. 
Instrumentation  collection  demands  are  consequently  much  lighter.  Attributes  Aj  take  only  the  values  0 
and  1.  Let  0,(1)  denote  the  occupancy  time  of  A,  at  value  1.  Total  elapsed  time  T is 
T = o^l)  + 02(1)  + 03(1),  since  a node  of  system  ABCD  is  always  in  one  (and  only  one)  of  the  A,. 
Consequently,  T - 0,(1)  = 0,(0),  and  only  the  0,(1)  need  be  recorded.  Averaging  across  nodes  for  means: 


*i(l) 

o2(  1) 

*3(1) 

17.0 

11.5 

0.5 

T5:  Mean  Node  Occupancies  for  A i , A 2,  A3 
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Measuring  or  Estimating.  Simple  reasoning  about  states  leads  to  the  above.  While  measurements  taken 
at  one  randomly  selected  node  could  also  establish  occupancy  estimates,  these  entail  more  uncertainty: 
Overall  service  is  estimated , based  upon  the  chosen  node  being  fully  representative.  Measuring  the 
full  application  service  by  summing  local  measurements  ensures  that  incidental  variations  among  nodes 
have  much  diminished  influences.  Consequently,  one  role  of  any  line  of  argument  is  to  show  what  system 
state  details  have  been  lost  or  preserved,  and  at  what  cost  As  for  mean  node  occupancies  (e.g.,  in  T5). 
homogeneous  applications  occur  frequently  enough  to  render  the  approach  worthwhile 
provided  that  it  is  accurate.  Three  general  examples  provide  evidence  that  it  is. 


Mean  Node  Occupancies  and  Three  Hypercube  Paradigms 


The  experiments  with  local  states  involve  our  NIST  specially-instrumented  iPSC-1  Hypercube 
system.  A choice  of  three  synthetic  communication  benchmarks  underscores  that  (i)  iPSC-1  node 
performance  is  relatively  simple,  and  (ii)  inter-node  communication  is  a dominant  performance  factor. 
The  benchmarks  explore  points  along  a spectrum  of  communication  interdependencies  [LY089b].  In 
Raridom , tasks  proceed  independently  once  each  is  spawned.  Radiation  transport  has  this  property.  Mesh 
simulates  locally-de pendent  processes  typical  of  fluid  models.  Ring  has  each  node  dependent  upon  all 
others  for  the  next  time-step  calculation.  Molecular  dynamics  might  approach  this  global  level  of  process 
dependency.  The  analysis  yields  a fresh  look  at  original  data  gathered  in  1989  by  R.  Snelick  [SNE89'  to 
the  following  specifications: 


Application  Paradigms 

Bl--Random 

B2— Mesh 

B3-Ring 

Distinct  Trial 

Parameter  Sets 

■ 

60 

75 

75 

Range,  Elapsed  Times 

8.4-202  s 

4.2—146  s 

3.1-789  s 

Independent  Variables 

exit  rate,  e 
nodes,  n 
packets/  ms  g,  p 
grain,  g 

Ranges 

75-225 

4-16 
fixed  (1) 
200-1000 

of  Independent  V 

na. 

9-16 

1-16 

20-100 

anables 

na 

4-16 

1-16 

20-100 

Specifications  for  Benchmark  Trials  [from  SNE89] 

There  are  m =4  dependent  variables  (observables)  that  identify-  critical  states  of  each  node: 
communication  interrupt  service  (/),  remainder  of  system  service  and  user- mode  (U),  message- sending  [S), 
and  message-receiving  (R).  These  are  the  A,.  (An  iPSC-1  node  has  no  separate  communication  processor. 
While  further  details  of  node  state  would  be  informative  (such  as  synchronous  or  asynchronous 
communication),  the  four  rudimentary  states  are  quite  serviceable.  Our  16  processor  system  has  16 
corresponding  instrumentation  boards  that  capture  node  state  information. 
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Mean  node  state  occupancies  for  7,  S,  R,  and  U are  recorded  along  with  the  independent  parameter 
settings.  For  example,  Ring 's  independent  parameters  are  n,  p,  and  g.  These  respectively  denote  the 
number  of  nodes,  packets  per  message,  and  computational  grain,  i.e.,  DO-loop  iterations  per  message 
datum.  Trials  for  Ring  are  each  recorded  as: 


Independent  Variables  Dependent  Variables  (Observables) 


n 

P 

8 

Oi(V*I 

o2(l>S 

oi(l)=R 

& 

A 

•'T 

lo 

Records  for  Mesh  are  the  same  as  for  Ring.  Random  omits  /?,  which  is  fixed  at  one  packet  per  message,  and 
introduces  e,  the  exit  rate.  The  exit  rate  controls  how  often  a message  is  sent  to  another  randomly-chosen 
node.  Each  such  message  contains  some  workload  specification  (again  a random  amount).  While  the 
benchmarks  B1-B3  can  be  compared  solely  on  the  basis  of  state  occupancies  established  from  one  run  per 
benchmark,  the  common  independent  variables  ( e , n,  p,  g)  support  a far  broader  range  of  evaluations. 
Consequently,  each  benchmark  code  is  run  with  numerous  parameter  settings. 


Discussion  and  Results 


The  210  (60 + 75  + 75)  trial  data  records  have  been  reduced  to  a compact  tableau  (shown  below)  of 
four  equations  per  paradigm.  An  excellent  compression  of  840  raw  measurements  (210  trials  * 4 states) 
has  been  made  into  12  short,  parametric  equations  (3  paradigms  * 4 states).  The  reduction  is  70:1  for 
measurement  records  to  equations,  a clear  demonstration  of  the  efficacy  of  the  state-based  approach.  There 
is  a moderate  but  deliberate  loss  of  accuracy.  Since  tolerances  of  ±20%  are  not  unusual  in  everyday 
software  experience,  statistical  analyses  of  residual  terms  were  terminated  near  this  tolerance.  The 
number  of  equations  per  benchmark  corresponds  to  the  number  of  system  states.  Tighter  tolerances  add 
further  terms  to  the  equations,  but  neither  this  nor  additional  data  affect  the  number  of  equations. 

Mean  node  states  /,  S,  R,  U,  and  their  occupancy  estimators  lend  structure  to  the  table.  Columns 
sum  to  elapsed  application  time.  Rows  provide  comparisons  among  benchmarks  for  a selected  state.  For 
example,  scaling  the  system  up  in  nodes,  n,  shortens  receive  times,  /?,  for  B1  and  B2  but  lengthens  B3’s. 
Because  the  applications  share  several  independent  parameters,  their  equations  each  generate 
response  surfaces  that  can  be  compared  in  varying  degrees.  The  surfaces  help  determine  the  suitability  of 
benchmarking  regions.  Certainly  parameter  settings  are  suspect  when  they  lie  near  singularities  in  the 
surfaces. 

In  general,  parameter  sets  of  two  distinct  benchmarks  may  vary  widely.  Benchmarks  whose 
parameter  sets  relate  as  strictly  monotone  are  easily  compared  (as  here),  since  a unique  mapping  from  one 
set  of  parameter  values  to  the  other  is  ensured.  The  chosen  family,  B1-B3,  was  designed  from  the  start  to 
explore  system  communication  parameters,  so  parameter  commonality  is  not  accidental.  A weaker 
relationship  between  parameter  sets  indicates  algorithms  with  less  common  structure.  Some  parametric 
dimensions  may  lack  mutual  meaning.  Even  among  the  tailored  set,  parameter  e is  unique  to  Bl.  An 
ad  hoc  assemblage  of  benchmarks  may  have  no  significant  parameter  relationships,  a fact  that  weakens  its 
utility.  What  benchmarks  always  share  are  the  performance  states  for  the  system  of  interest.  Occupancy 
signatures,  at  least,  lie  on  the  same  system  dimensions.  In  worst  case,  a summary  is  a table  of  static 
numerical  entries,  rather  than  a tableau  of  dynamic  parametric  equations.  Yet,  even  the  static  table  has 
considerably  more  consistency  and  detail  than  the  usual  benchmark  results. 


-8- 


iPSC 

Application  Paradigms 

States 

B 1-Random 

B2— Mesh 

B3— Ring 

Comm.  Intrpt.,  I 

0.00227  - + 0.040 
n 

0.952  2.  + 0.010 
n 

0.00760  pn  + 0.0660 

Send,  S 

0.0013  1L+J-  - 0.0188 

n 

2.40  2.  + 0.333 
n 

0.0114/7  +0.067 

Receive,  R 

22.6  8 ~ 100  + 2.86 
ne 

0.619  2-  + 0.300 
n 

0.0367  (p  +4 )n  + .296 

User,  U 

0.522  £ + 0.905 
n 

0.089  pg  + 40.0  2. 

8 

0.0299  png  -0.380 

Estimator  T=I+S+R+U 

±15%  of  measured 

± 10%  of  measured 

± 20%  of  measured 

Accuracy,  95%  Confidence 

elapsed  time 

elapsed  time 

elapsed  time 

Comparing  Benchmarks  B1-B3  via  their  State  Occupancy  Estimators 

(sum  each  column  for  elapsed  time) 


Problems  in  the  Data.  For  the  three  examples,  the  choice  of  grain  g=0  is  unrepresentative  of  other  grain 
values.  Zero-grained  trials  have  been  excluded.  Another  problem  of  lesser  magnitude  is  the  variation  of 
performance  against  mesh  node  points,  n = 9,  12,  16,  for  Mesh.  A plot  of  residuals  for  this  benchmark 
shows  that  Receive  and  Interrupts  have  distinct  modalities  for  different  numbers  of  processors. 
Nonetheless,  such  variations  have  been  omitted  as  equation  terms  above  because  their  effects  lack 
significance.  However,  identifying  sources  of  uncertainty  encourages  a more  enlightened  use  of  a 
benchmark.  In  particular,  sensitive  parameters  are  identified  and  noted.  On  new  architectures  these 
parameters  should  be  checked,  for  the  magnitude  of  their  contributions  may  differ. 

Estimators  and  Modeling.  Besides  the  powerful  compression  of  information,  there  is  real  advantage  in 
having  estimator  equations  that  capture  benchmark  performances  via  parameters:  Their  structure  supports 
broader  uses  than  are  typical  of  conventionally-reported  results.  Take  for  example  Ring,  whose  user-time 
is  a function  of  grain,  g.  Grain  corresponds  to  application  computation  per  message  packet,  so  that 
conjectures  on  the  effects  of  a user-state  accelerator  or  improved  application  code  can  be  couched  in  terms 
of  g.  The  question  is  natural  to  a user.  Results  appear  below.  The  outcome  from  varying  g is  consonant 
with  that  obtained  from  a NIST  emulation  method;  the  latter  permits  variable  physical  transport  speeds  on 
hypercube  communication  links  [AS90].  In  both  methods,  the  balance  is  changed  between  compute-speed 
and  communication-rate.  Also  with  both,  those  settings  of  Ring  that  provide  speedup  for  one  case  do  little 
in  the  other.  Although  convenient  and  quick,  the  benchmark  method  will  not  always  work,  for  the 
independent  variables  may  be  arranged  poorly  in  the  equations  for  certain  questions. 
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Relative  Improvements  for  Ring  Program, 
Two  Distinct  Parameter  Settings 


On  a Real  System.  Everyday  use  of  a computer  should  be  rather  removed  from  the  expediencies  of 
experimental  methods  on  test  systems.  Instrumentation  for  a commercial  system  is  certainly  more 
convenient  when  it  is  automatic  within  the  operating  system.  Application  code  is  then  untouched,  and 
users  untroubled  by  collection  details.  An  operational  hypercube  system  can  do  this,  i.e.,  it  can  provide 
the  utility  of  local  collections  (as  discussed  above)  at  very  little  expense  to  user  or  system.  Process 
context-switching  presents  a major  overhead  to  which  measurement  instrumentation  overhead  is  only  a 
small  increment.  The  four  iPSC  node  states  /,  S,  R,  and  U need  a fast  system  clock  and  fast  registers  to 
record  occupancies.  Saved  context  must  support  these  registers.  This  is  roughly  twice  as  much  timing 
context  (from  twice  as  many  states)  as  that  which  supports  time-sharing  statistics,  but  it  is  still  slight 
relative  to  the  overall  process  context  that  is  saved.  The  operating  system  collects  and  aggregates  node 
statistics  upon  termination  of  a job. 


Local  State  Information  from  Amorphous  Systems 


The  treatment  of  state  is  potentially  more  difficult  whenever  there  are  many  system  factors,  each  with 
distinct  rates.  A static  layout  of  a heterogeneous  program  will  then  determine  a unique  set  of  resource 
requests,  each  of  which  counts  in  determining  performance  variation.  Comparabilities  among  programs 
will  be  difficult,  because  each  will  have  costs  bound  strongly  to  its  program  structure.  The  best  success  is 
when  disparate  applications  can  have  their  resource  demands  reduced  to  a limited  set  of  common  system 
resources.  However,  some  recent  distributed-memory  systems  have  dynamic  allocations.  As  an  example, 
the  Myrias  SPS-2  [PPG90]  presents  a global  view  of  its  address  space,  with  page  faults  generating  fetches 
from  other  processors.  Memory  space  management  is  transparent  to  users.  Tasks  similarly  migrate  from 
one  processor  to  another  to  improve  load  balancing.  The  result  is  a shifting  set  of  execution  costs  that  is 
summarized  for  a job  at  any  point  in  time  via  an  estats  invocation: 
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estats(l): 


User 

System 

Wait 

Idle 

The  items  reported  are  similar  to  those  just  seen  for  the  iPSC-1,  but  factored  more  conventionally. 
System  on  the  Myrias  covers  aspects  of  service  in  the  system-state.  Wait  is  time  spent  awaiting  event 
completions,  e.g.,  page  fetches.  Idle  records  times  when  both  ready  and  blocked  queues  are  empty; 
because  a Myrias  task  is  not  bound  to  a processor,  it  is  possible  for  a processor  to  have  absolutely  nothing 
scheduled.  (The  load-balancing  mechanism  constantly  tries  to  prevent  this.)  The  Myrias  estats  call 
summarizes  many  variable  memory  and  task  costs  that  would  be  too  complex  in  detail.  Constant  "stirring" 
by  system  management  removes  many  concerns  about  homogeneity  or  heterogeneity  of  programs.  As 
with  the  homogeneous  codes  on  the  iPSC,  estats  and  its  statistics  scale  up  with  the  system. 


Summary  and  Conclusion 


A state  occupancy  view  of  performance  removes  time  as  an  overwhelming  concern.  While  transient 
behavior  not  incorporated  explicitly  into  a state  is  lost,  occupancies  still  capture  general  performance 
details.  Emphasis  is  upon  comprehensive  results  that  are  nonetheless  simply  compared.  Summaries  are 
defined  by  the  resolution  of  the  chosen  state-space,  rather  than  the  amount  of  collected  data. 

The  challenge  of  state  occupancies  is  to  move  effectively  from  first  principles  to  actual 
circumstances.  Practical  limitations  almost  always  dictate  simplifications  and  approximate  methods.  This 
explicit  reconciling  of  theory  to  practice  also  identifies  sources  of  measurement  uncertainty  and  error  that 
are  so  often  overlooked  in  benchmarking.  Specific  issues  involve: 


• assuring  that  states  are  succinctly  defined  and  recorded, 

• managing  a large  number  of  states  as  parallel  systems  scale  up  in  number  of  processors, 

• compressing  state  representations, 

• obtaining  global  information  or  circumventing  the  global  state  view. 

The  issues  are  related  one  to  another.  Certainly,  the  discussions  on  homogeneous  applications  and 
amorphous  systems  him  at  typical  tradeoff  possibilities.  From  the  system  standpoint,  trends  are  toward:  (i) 
duplicating  via  VLSI --good  because  architecture  is  more  uniform;  (ii)  decentralizing  functions -poor  for 
ascertaining  state;  (iii)  transparent  balancing  and  referencing  -good  to  counter  application  bindings.  This 
assessment  assumes  an  instrumentation  geared  primarily  toward  time  -based  samplings.  There  will  be 
ample  opportunity  to  break  from  this  mold  as  systems  increasingly  employ  by -value  communications, 
rather  than  the  by -reference  of  shared-memory.  As  emphasized  above,  the  very  philosophy  of  state 
occupancies  is  to  banish  as  much  of  time  (e.g.,  state  sequences,  algorithmic  iterations)  as  possible.  The 
important  correlations  identify  which  resources  handle  which  demand;  exact  occurrences  in  absolute  time 
are  immaterial.  Consequently,  the  use  of  tagged -messages  would  seem  to  be  a natural  matching  of  the 
strengths  of  distributed  systems  and  the  relaxed  requirements  of  state  occupancies.  The  challenge  would 
be  to  keep  the  tagging  independent  of  specific  applications.  Otherwise,  comparisons  among  applications 
would  be  very  limited. 
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Conclusion.  Restricting  comparisons  among  applications  to  either  (i)  their  signatures  of  system  service, 
or  (ii)  parametric  sets  of  signatures,  is  far  simpler  and  more  readily  accessible  than  detail-by-detail 
algorithmic  comparisons.  Relying  principally  upon  measurement,  state  occupancies  present  an  attractive 
complement  to  other  more  analytic  and  specialized  methods. 
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